Modeling Conversational Speech for Speech Recognition
نویسندگان
چکیده
In language modeling for speech recognition the goal is to constrain the search of the speech recognizer by providing a model which can, given a context, indicate what the next most likely word will be. In this paper, we explore how the addition of information to the text, in particular part of speech and dysfluency annotations, can be used to,build more complex language models. In particular, we ask two questions. First, in conversational speech, where there is a less clear notion of "sentence" than in written text, does segmenting the text into linguistically or semantically based units contribute to a better language model than merely segmenting based on broad acoustic information, such as pauses. Second, is the sentence itself a good unit to be modeling, or should we look at smaller units, for example, dividing a sentence into a "given" and "new" portion and segmenting out acknowledgments and replies. To answer these questions, we present a variety of kinds of analysis, from vocabulary distributions to perplexities on language models. The next step will be modeling conversations and incorporating those models into a speech recognizer.
منابع مشابه
Rate-of-speech Modeling for Large Vocabulary Conversational Speech Recognition
Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect automatic speech recognition (ASR) systems. To deal with these ROS effects, we propose to use parallel, rate-specific, acoustic models: one for fast speech, the other for slow speech. Rate switching is permitted at word boundaries, to allow modeling within-sentence speech rate variat...
متن کاملRate-dependent Acoustic Modeling for Large Vocabulary Conversational Speech Recognition
Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect automatic speech recognition (ASR) systems. To deal with these ROS effects, we propose to use parallel, rate-specific, acoustic models: one for fast speech, the other for slow speech. Rate switching is permitted at word boundaries, to allow modeling within-sentence speech rate variat...
متن کاملImproved MLLR speaker adaptation using confidence measures for conversational speech recognition
Automatic recognition of conversational speech tends to have higher word error rates (WER) than read speech. Improvements gained from unsupervised speaker adaptation methods like Maximum Likelihood Linear Regression (MLLR) [1] are reduced because of their sensitivity to recognition errors in the first pass. We show that a more detailed modeling of adaptation classes and the use of confidence me...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملRecognition of conversational telephone speech using the JANUS speech engine
JANUS SPEECH ENGINE Torsten Zeppenfeld Michael Finke Klaus Ries Martin Westphal Alex Waibel Interactive Systems Laboratories Carnegie Mellon University, USA University of Karlsruhe, Germany ABSTRACT Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10% or lower can now be reached on speech dictation tasks over ...
متن کاملEnhanced tree clustering with single pronunciation dictionary for conversational speech recognition
Modeling pronunciation variation is key for recognizing conversational speech. Rather than being limited to dictionary modeling, we argue that triphone clustering is an integral part of pronunciation modeling. We propose a new approach called enhanced tree clustering. This approach, in contrast to traditional decision tree based state tying, allows parameter sharing across phonemes. We show tha...
متن کامل